NEW  PROMISES  IN  READING  BY  LISTENING 


by 

Francis  F.  Lee,  Ph.D. 

Lexicon,  Inc. 

60  Turner  Street 
Waltham,  Mass.  02154 


Speech  and  writing  are  two  forms  of  hu- 
man communication.  It  is  safe  to  say  that 
speech  comes  before  writing,  for  there  are 
still  many  primitive  peoples  in  this  world 
who  do  not  yet  have  written  form  of  their 
spoken  language.  While  writing  can  closely 
represent  speech  in  the  essential  semantical 
sense,  we  must  recognize  that  speech  and 
writing  each  has  its  own  distinct  attributes . 

Speech  is  produced  in  real-time.  The 
speaker  "composes"  the  message  as  he  speaks. 
He  may,  of  course,  pause  from  time  to  time 
to  search  for  the  right  words  which  best  con- 
vey his  idea.  He  raises  and  lowers  his 
voice,  speeds  up  and  slows  down  to  give  the 
speech  just  the  right  shade  of  meaning  he 
wishes  to  impart.  His  voice,  to  those  who 
know,  can  easily  be  recognized  as  his.  Un- 
der many  circumstances , he  may  add  gestures 
to  help  him  communicate . From  a listener's 
point  of  view  he  accepts  the  message  in 
real-time  as  it  happens. 

On  the  other  hand,  language  in  written 
or  printed  form  takes  much  longer  time  to 
produce  than  to  receive.  Time  is  taken  to 
revise  and  re-revise  the  script,  to  set  the 
type,  to  check  the  galley  proof  and  to  print. 
To  segment  and  link  his  ideas  the  writer  has 
only  the  use  of  the  punctuation  marks,  the 
grouping  of  sentences  into  paragraphs  and 
the  grouping  of  paragraphs  into  chapters . 

For  the  reader,  he  can  read  the  material  at 
a place,  at  a time  and  at  a rate  of  his  own 
choice.  He  can  skimp  or  scan  over  some 
parts,  and  dwell  slowly  over  others.  Print- 
ed material  can  be  produced  and  reproduced 
at  extremely  low  cost. 


With  recordings  of  human  speech,  we  no 
longer  have  to  accept  the  message  as  it  is 
originally  produced.  With  easy  to  use  and 
reasonably  priced  cassette  tape  recorders 
now  available  everywhere,  tape  recorded 
speech  has  become  a very  popular  medium  of 
information  exchange.  To  the  visually  hand- 
icapped it  is  an  alternative  or  a supplement 
to  Braille.  To  the  general  public,  it  saves 
the  cost  and  time  in  putting  the  message  in 
written  form.  The  listener  , while  listen- 
ing to  a recording  being  played  back,  can 
still  use  his  visual  channel  for  performing 
other  tasks.  Yet,  because  it  is  a speech 
recording,  the  listener  is  essentially  paced 
by  the  original  rate  of  production.  Any 
attempt  on  his  part  to  alter  the  tape  speed 
by  an  appreciable  extent  affects  the  quality 
of  the  sound  and  its  intelligibility.  We  are 
all  familiar  with  the  fact  that  when  we  speed 
up  a recording  we  get  the  Donald  Duck-like 
sound,  and  when  we  slow  down  a recording  we 
get  an  unpleasant  growl. 

How  can  we  provide  for  the  listeners 
of  recorded  speech  the  freedom  of  controlling 
his  own  listening  pace?  We  know,  from  our 
experience,  recorded  lectures,  for  example, 
are  in  general  too  slow  for  the  average  lis- 
tener; hence,  we  wish  to  speed  them  up.  We 
also  know  that  when  the  recording  is  noisy 
or  has  several  simultaneous  voices,  a slow- 
ing down  of  the  playing  speed  can  enhance 
comprehension.  Therefore,  we  want  to  have 
the  freedom  to  control  the  listening  rate  in 
both  directions  at  listening  time.  What  we 
want  is  a speech  time  compressor-expander 
machine . 


Speech  time  compression-expansion  is  not 
new.  In  1959,  Fairbanks,  Everitt  and  Jaeger 
were  issued  a U.S.  patent  on  a machine  which 
produces  a normal  sounding  output  tape  from 
a tape  played  at  higher  than  normal  or  lower 
than  normal  tape  speed(l).  The  technique  used 
was  discarding  or  inserting  periodically  seg- 
ments of  speech  of  durations  in  the  order  of 
10  to  80  milliseconds. 

The  basic  mechanism  used  is  a rotating 
magnetic  head  assembly  with  four  matched 
heads  spaced  90  degrees  apart.  The  tape  wraps 
around  the  rotating  magnetic  head  assembly 
for  90  degrees.  For  example,  in  compression, 
the  head  assembly  rotates  in  the  same  direction 
tion  of  tape  motion.  The  relative  speed  of 
the  tape  with  respect  to  a head  when  it  is 
in  contact  with  the  tape  is  maintained  at  the 
recording  tape  speed.  Since  the  tape-to-head 
relative  speed  is  unchanged,  the  sound  spec- 
trum remains  also  unchanged.  But  when  one 
head  leaves  and  a second  head  comes  into  the 
contact  zone,  the  segment  of  tape  wrapped 
around  the  head  assembly  will  not  be  scanned 
by  either  head  and  will  be  effectively  dis- 
carded. The  figure  below  illustrates  the 
relationship  of  the  tape  and  the  head  assem- 
bly at  two  instants  of  time. 


ONE  HEAD  IN  TWO  HEADS  IN 

CONTACT  WITH  CHANGE-OVER 

TAPE  POSITION 

Speech  compressor-expanders  employing  rotat- 
ing magnetic  heads  have  been  commercially 
produced  in  Europe  and  this  country  (2). The 
cost,  size  and  maintenance  required  put  the 
machine  out  of  reach  of  individuals. 

To  understand  why  the  sampling  method  in- 
troduced by  Fairbanks  and  his  colleagues 
works,  we  have  to  look  at  the  linguistic  and 
phonological  side  of  speech. 


Speech  is  made  up  from  sound  elements 
which  the  linquistics  call  phonemes.  There 
are  consonant  phonemes  such  as  the  "kuh" 
sound  in  "cat",  and  vowel  phonemes  such  as 
the  "ah"  sound  in  "father".  Since  phonemes 
are  produced  one  after  another  in  a complex 
overlapping  manner,  it  is  not  possible  to 
isolate  them,  although  it  is  not  too  diffi- 
cult to  determine  the  effective  phoneme  du- 
rations. The  phoneme  duration  varies  from 
tens  of  milliseconds  for  some  consonants  to 

(3  ) 

hundreds  of  milliseconds  for  vowels.  Suc- 
cessful speech  time  compression  has  been 
achieved  when  segments  of  speech  of  a frac- 
tion of  minimum  phoneme  duration  are  deleted, 
and  successful  speech  time  expansion  has  been 
achieved  when  such  segments  of  speech  are  re- 
peated. The  quality  of  the  rate-altered 
speech  depends  on  the  care  of  joining  the 
segments  together  and  the  smoothing  done  to 
the  joints. 

Speech  consists  of  more  than  segmental 
phonemes.  On  the  phrase  and  sentence  level 
there  are  the  stress,  intonation  and  rhythm 
we  find  so  important. 

Stress  and  intonation  are  carried  on  three 
acoustic  parameters,  namely,  voice  pitch, 
vowel  sound  duration  and  intensity.  Selec- 
tively shortening  vowel  duration,  while  a 
possible  way  of  achieving  speech  time  com- 
pression, would  upset  the  stress  and  intona- 
tion pattern. 

How  about  pauses?  Can  we  eliminate  the 
pauses  to  achieve  speech  time  compression? 

The  answer  to  this  question  depends  on  what 
kind  of  pause  it  is.  Obviously,  the  very 
long  pauses  during  which  the  speaker  has 
nothing  to  say  and  the  listener  is  not  ex- 
pected to  think  can  be  completely  eliminated. 
Short  pauses  of  duration  of  less  than  one 
half  second  are  usually  either  hesitation 
pauses  or  juncture  pauses.  Hesitation 
pauses  are  used  by  most  speakers  to  search  in 
his  mind  for  the  most  suitable  words  or  phra- 
ses. From  a listener's  point  of  view  it  can 
be  eliminated.  Juncture  pauses,  on  the  other 


134 


t 


hand,  are  placed  at  phrase  and  sentence  boun- 
daries to  serve  as  punctuation . Both  the 
speaker  and  the  listener  need  the  juncture 
pause  time  to  mentally  process  the  informa- 
tion. If  juncture  pauses  are  eliminated,  the 
listener,  while  hearing  every  word  of  the 
speech,  can  miss  a lot  of  the  meaning  because 
he  may  not  have  enough  time  to  digest  the 
overflowing  information.  Unfortunately, 
hesitation  pauses  and  juncture  pauses  span 
about  the  same  range  of  duration  and  there 
is  no  way  to  selectively  delete  one  kind  and 

(4) 

not  the  other.  'Deletion  of  pauses  longer 
than  aboiit  one  half  second  in  duration  is  an 
effective  way  to  compress  speech  tape  re- 
cordings, unless  such  long  pauses  are  delib- 
erately placed  for  listener  interaction.  A 
twenty-four  hour  telephone  wire-tapping  re- 
cording, for  example,  can  be  greatly  bene- 
fited from  the  use  of  long  pause  deletion 
technique . 

In  conclusion,  I would  like  to  demon- 
strate the  Varispeech-I , an  electronic  speech 
time-compressor/expander  which  is  manufac- 
tured by  Lexicon,  Incorporated.  It  is  rea- 
sonably compact,  easy  to  operate  and,  thanks 


to  the  elimination  of  rotating  heads,  very 
reliable.  We  hope,  after  hearing  the  demon- 
stration, you  can  form  an  opinion  of  your 
own  whether  it  brings  some  new  promises  in 
reading  by  listening. 

Ref  erences 

1.  U.  S.  Patent  2 , 886 , 650 

2.  For  example:  Eltro  Information  Chan 

ger  Mark  II,  manufactured  by  Eltro 

& Company,  GmbH,  Heidelberg,  Ger- 
many, and  Whirling  Dervish,  manu- 
factured by  Discerned  Sound  of 
North  Hollywood,  California. 

3.  I.  Lehiste  and  G.  E.  Peterson,  Tran- 
sitions, Glides  and  Dipthongs , 
Journal  of  Acoustic  Society  of 
America,  Vol.  33>  Number  3,  March, 
1961,  pp.  268-277. 

4.  D.  S.  Boomer  and  A.  T.  Dittmann, 
Hesitation  Pauses  and  Juncture 
Pauses  in  Speech,  Language  and 
Speech,  Vol.  5>  Part  4,  1 962 , 

pp  215  - 220. 


(A  recording  of  the  above  mentioned  demonstration 
is  available  from  the  author.  Ed.) 


135 


